Elsevier Hybrid Open Access Analysis

Publishers rarely make invoice data for open access publication fees transparent. Elsevier is an remarkable exception. The publisher provides machine-readable data about open access funding, including funding organisations and fee waivers. This blog post demonstrates how to mine such data with R. Analysing the resulting dataset of 70,000 hybrid open access articles published in around 1,800 Elsevier journals between 2015 and now reveals a growth of centrally paid publication fees. Nevertheless, the majority of funding sources remains unknown, raising important questions about the transparency of fee-based open access publishing.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
Aug 29, 2019

In September 2018, the cOAltion S, a group of international research funders, announced its widely discussed Plan S. According to its principles, publication fees that may arise when publishing open access should be covered by funders or research organizations directly. Although surveys(Solomon and Björk 2011; Dallmeier-Tiessen et al. 2011) suggest that already many authors do not pay publication fees themselves, monitoring these funding streams is challenging, because publishers rarely share invoice data. But also not all funders and research organizations report open access payments publicly, resulting in an intransparent situation where policy-making and analytics lack essential data about transitioning subscription journals to open access.

This blogpost presents a dataset comprising publicly available hybrid open access sponsorship information from Elsevier, a major publisher of scholarly journals. This dataset, created using metadata from Crossref and mining open access full-texts, serves as critical input to ongoing discussions around transition subscription journals to open access. The methods used to obtain the data not only address key challenges to discover hybrid open access articles along with funding and affiliation information with open tools and services. Elsevier’s effort to make invoice information openly available also serves as a good practice example how publishers can make open access funding more transparent.

To demonstrate its potential, the dataset will be used to analyse the number and the proportion of hybrid open access articles among Elsevier journals. Drawing on Elsevier’s funding information, it will be also investigated whether publication fees were billed to authors or to funders that made an agreement with Elsevier, or if the fees were waived. Moreover, text-mined author email domains will provide a rough approximation of the affiliation of the first resp. corresponding author, an important data point for delineating open access funding.

The resulting dataset is openly available on GitHub along with the source code.

Methods

As a start, I used the Elsevier publication fee price list, shared as pdf document, to determine hybrid open access journals in Elsevier’s journal portfolio. The rOpenSci tabulizer package allowed to extract data about these journals from this file.

Following the Hybrid Open Access Journal Dashboard, an interactive analytical application from the SUB Göttingen, I interfaced the Crossref REST API with the R package rcrossref. The first API call retrieved facet field counts for license URLs and the yearly article volumes for the period 2015 -– 2019 for every journal. After matching license URLs indicating open access articles, a second API call checked license metadata per journal. Here, using the Crossref’s REST API filters license.url and license.delay allowed to exclude delayed open access articles. For every immediate open access article, Crossref provides metadatad including full-text links.

Elsevier provides access to full-texts as html and xml document via the Crossref Text and Data Mining Services (Crossref-TDM). Surprisingly, the xml representation not only contains the full-text, but also embedded metadata including information about open access sponsorship in the <core> node:


<openaccess>1</openaccess>
<openaccessArticle>true</openaccessArticle>
<openaccessType>Full</openaccessType>
<openArchiveArticle>false</openArchiveArticle>
<openaccessSponsorName>
  Arts and Humanities Research Council
</openaccessSponsorName>
<openaccessSponsorType>FundingBody</openaccessSponsorType>
<openaccessUserLicense>
  http://creativecommons.org/licenses/by/4.0/
</openaccessUserLicense>

Snapshot of open access metadata in Elsevier XML full. https://api.elsevier.com/content/article/pii/S1475158518302261

After downloading the Elsevier full-texts with the crminer package, a client maintained by rOpenSci, I extracted the above-highlighted open access informatiom from the xml documents.

Moreover, I parsed the first author email address, assuming that email domains roughly indicate the affiliation of the first respective corresponding author at the time of publication. The package urltools enabled to extract email domains and to split them in meaningful parts.

Dataset characteristics and availability

The resulting dataset comprises the following variables, and is openly shared via GitHub.

First ten rows


library(rmarkdown)
hybrid_df <- readr::read_csv("data/els_hybrid_info_normalized.csv")
paged_table(head(hybrid_df, 10))

It must be noted, however, that open access information from Elsevier full-text was not documented at the time of writing this blogpost.

Results

In total, the dataset comprises 63,577 hybrid open access articles from 1,703 hybrid open access journals published between January 2015 and July 2019.

What is the uptake of hybrid open access among Elsevier journals?

Using this datasets, the share of hybrid open access articles per journal was calculated. To explore variations among journals, Bob Rudis ggeconodist package was used. The package does a great job replicating a boxplot aesthetics from The Economist magazine.

The figure shows a slow, but steady hybrid open access uptake. The median open access proportion was around 3% in the first seven months in 2019. 1,703 of 1,985 subscription journals from Elsevier offering hybrid open access did in fact publish at least one article under this model, corresponding to an share of 86 %.

Hybrid open access facilitated by institutional agreements with Elsevier

Elsevier usually requires authors to pay a publication fee, also known as article processing charge (APC) to publish open access. Many authors make use of funding from grant agencies or academic institutions to cover such fees. To streamline this process, some funding bodies and institutions have agreed central payment options for affiliated researcher. Elsevier also provides APC waivers.

In most cases, payment notifications were send to the authors paid directly 59 %. Elsevier lists a funding body covering the open access publication fee for around one third of articles.

The following interactive visualization let’s you browse for funders as disclosed by Elsevier.

Mostly British and Dutch funders sponsored hybrid open access in Elsevier journals. But also the German Federal Ministry of Education and Research (BMBF) is well represented despite the current boycott from most universities and research organizations in Germany. Since 2018, the BMBF financially supported 152 hybrid open access articles that appeared in 110 Elsevier journals according to the publisher.

Author affiliation

In addition to funding information, email domains were parsed from Elsevier full-texts. These domains roughly indicate the affiliation of the first or of the corresponding authors, respectively, a data point used to delineate open access funding. In the following, a hierarchical, interactive treemap visualizes the distribution of the email domains. Each top-level domain can be subdivided further into domain names representing academic institutions or companies. The size of each rectangle is proportional to the number of hybrid open access articles corresponding to this domain.

Discussion and conclusion

Dallmeier-Tiessen, Suenje, Robert Darby, Bettina Goerner, Jenni Hyppoelae, Peter Igo-Kemenes, Deborah Kahn, Simon C. Lambert, et al. 2011. “Highlights from the SOAP Project Survey. What Scientists Think About Open Access Publishing.” http://arxiv.org/abs/1101.5260.

Solomon, David J., and Bo-Christer Björk. 2011. “Publication Fees in Open Access Publishing: Sources of Funding and Factors Influencing Choice of Journal.” Journal of the Association for Information Science and Technology 63 (1). Wiley-Blackwell: 98–107. https://doi.org/10.1002/asi.21660.